Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 858
Filter
1.
Sensors (Basel) ; 24(9)2024 Apr 25.
Article in English | MEDLINE | ID: mdl-38732857

ABSTRACT

This study presents a pioneering approach that leverages advanced sensing technologies and data processing techniques to enhance the process of clinical documentation generation during medical consultations. By employing sophisticated sensors to capture and interpret various cues such as speech patterns, intonations, or pauses, the system aims to accurately perceive and understand patient-doctor interactions in real time. This sensing capability allows for the automation of transcription and summarization tasks, facilitating the creation of concise and informative clinical documents. Through the integration of automatic speech recognition sensors, spoken dialogue is seamlessly converted into text, enabling efficient data capture. Additionally, deep models such as Transformer models are utilized to extract and analyze crucial information from the dialogue, ensuring that the generated summaries encapsulate the essence of the consultations accurately. Despite encountering challenges during development, experimentation with these sensing technologies has yielded promising results. The system achieved a maximum ROUGE-1 metric score of 0.57, demonstrating its effectiveness in summarizing complex medical discussions. This sensor-based approach aims to alleviate the administrative burden on healthcare professionals by automating documentation tasks and safeguarding important patient information. Ultimately, by enhancing the efficiency and reliability of clinical documentation, this innovative method contributes to improving overall healthcare outcomes.


Subject(s)
Deep Learning , Humans , Speech Recognition Software
2.
J Acoust Soc Am ; 155(5): 3060-3070, 2024 May 01.
Article in English | MEDLINE | ID: mdl-38717210

ABSTRACT

Speakers tailor their speech to different types of interlocutors. For example, speech directed to voice technology has different acoustic-phonetic characteristics than speech directed to a human. The present study investigates the perceptual consequences of human- and device-directed registers in English. We compare two groups of speakers: participants whose first language is English (L1) and bilingual L1 Mandarin-L2 English talkers. Participants produced short sentences in several conditions: an initial production and a repeat production after a human or device guise indicated either understanding or misunderstanding. In experiment 1, a separate group of L1 English listeners heard these sentences and transcribed the target words. In experiment 2, the same productions were transcribed by an automatic speech recognition (ASR) system. Results show that transcription accuracy was highest for L1 talkers for both human and ASR transcribers. Furthermore, there were no overall differences in transcription accuracy between human- and device-directed speech. Finally, while human listeners showed an intelligibility benefit for coda repair productions, the ASR transcriber did not benefit from these enhancements. Findings are discussed in terms of models of register adaptation, phonetic variation, and human-computer interaction.


Subject(s)
Multilingualism , Speech Intelligibility , Speech Perception , Humans , Male , Female , Adult , Young Adult , Speech Acoustics , Phonetics , Speech Recognition Software
3.
IEEE J Transl Eng Health Med ; 12: 382-389, 2024.
Article in English | MEDLINE | ID: mdl-38606392

ABSTRACT

Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.


Subject(s)
Speech Perception , Speech , Humans , Speech Recognition Software , Dysarthria/diagnosis , Speech Disorders
4.
PLoS One ; 19(4): e0302394, 2024.
Article in English | MEDLINE | ID: mdl-38669233

ABSTRACT

Digital speech recognition is a challenging problem that requires the ability to learn complex signal characteristics such as frequency, pitch, intensity, timbre, and melody, which traditional methods often face issues in recognizing. This article introduces three solutions based on convolutional neural networks (CNN) to solve the problem: 1D-CNN is designed to learn directly from digital data; 2DS-CNN and 2DM-CNN have a more complex architecture, transferring raw waveform into transformed images using Fourier transform to learn essential features. Experimental results on four large data sets, containing 30,000 samples for each, show that the three proposed models achieve superior performance compared to well-known models such as GoogLeNet and AlexNet, with the best accuracy of 95.87%, 99.65%, and 99.76%, respectively. With 5-10% higher performance than other models, the proposed solution has demonstrated the ability to effectively learn features, improve recognition accuracy and speed, and open up the potential for broad applications in virtual assistants, medical recording, and voice commands.


Subject(s)
Neural Networks, Computer , Speech Recognition Software , Humans , Speech/physiology , Algorithms
5.
Sensors (Basel) ; 24(8)2024 Apr 17.
Article in English | MEDLINE | ID: mdl-38676191

ABSTRACT

This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores.


Subject(s)
Noise , Speech Recognition Software , Humans , Cluster Analysis , Algorithms , Speech/physiology
6.
J Affect Disord ; 355: 40-49, 2024 Jun 15.
Article in English | MEDLINE | ID: mdl-38552911

ABSTRACT

BACKGROUND: Prior research has associated spoken language use with depression, yet studies often involve small or non-clinical samples and face challenges in the manual transcription of speech. This paper aimed to automatically identify depression-related topics in speech recordings collected from clinical samples. METHODS: The data included 3919 English free-response speech recordings collected via smartphones from 265 participants with a depression history. We transcribed speech recordings via automatic speech recognition (Whisper tool, OpenAI) and identified principal topics from transcriptions using a deep learning topic model (BERTopic). To identify depression risk topics and understand the context, we compared participants' depression severity and behavioral (extracted from wearable devices) and linguistic (extracted from transcribed texts) characteristics across identified topics. RESULTS: From the 29 topics identified, we identified 6 risk topics for depression: 'No Expectations', 'Sleep', 'Mental Therapy', 'Haircut', 'Studying', and 'Coursework'. Participants mentioning depression risk topics exhibited higher sleep variability, later sleep onset, and fewer daily steps and used fewer words, more negative language, and fewer leisure-related words in their speech recordings. LIMITATIONS: Our findings were derived from a depressed cohort with a specific speech task, potentially limiting the generalizability to non-clinical populations or other speech tasks. Additionally, some topics had small sample sizes, necessitating further validation in larger datasets. CONCLUSION: This study demonstrates that specific speech topics can indicate depression severity. The employed data-driven workflow provides a practical approach for analyzing large-scale speech data collected from real-world settings.


Subject(s)
Deep Learning , Speech , Humans , Smartphone , Depression/diagnosis , Speech Recognition Software
7.
Unfallchirurgie (Heidelb) ; 127(5): 374-380, 2024 May.
Article in German | MEDLINE | ID: mdl-38300253

ABSTRACT

BACKGROUND: Time is a scarce resource for physicians. One medical task is the request for radiological diagnostics. This process is characterized by high administrative complexity and sometimes considerable time consumption. Measures that lead to an administrative relief in favor of patient care have so far been lacking. AIM OF THE STUDY: Process optimization of the request for radiological diagnostics. As a proof of concept the request for radiological diagnostics was conducted using a mobile, smartphone and tablet-based application with dedicated voice recognition software in the Department of Trauma Surgery at the University Hospital of Würzburg (UKW). MATERIAL AND METHODS: In a prospective study, time differences and efficiency of the mobile app-based method (ukw.mobile based Application = UMBA) compared to the PC-based method (PC-based application = PCBA) for requesting radiological services were analyzed. The time from the indications to the completed request and the time required to create the request on the device were documented and assessed. Due to the non-normal distribution of the data, a Mann-Whitney U test was performed. RESULTS: The time from the indications to the completed request was significantly (p < 0.05) reduced using UMBA compared to PCBA (PCBA: mean ± standard difference [SD] 19.57 ± 33.24 min, median 3.00 min, interquartile range [IQR] 1.00-30.00 min vs. UMBA: 9.33 ± 13.94 min, median 1.00 min, IQR 0.00-20.00 min). The time to complete the request on the device was also significantly reduced using UMBA (PCBA: mean ± SD 63.77 ± 37.98 s, median 51.96 s, IQR 41.68-68.93 s vs. UMBA: 25.21 ± 11.18 s, median 20.00 s, IQR 17.27-29.00 s). CONCLUSION: The mobile, voice-assisted request process leads to a considerable time reduction in daily clinical routine and illustrates the potential of user-oriented, targeted digitalization in healthcare. In future, the process will be supported by artificial intelligence.


Subject(s)
Mobile Applications , Humans , Wounds and Injuries/diagnostic imaging , Wounds and Injuries/surgery , Germany , Prospective Studies , Computers, Handheld , Smartphone , Traumatology , Speech Recognition Software , Teleradiology/instrumentation , Teleradiology/methods , Acute Care Surgery
8.
Adv Sci (Weinh) ; 11(17): e2309826, 2024 May.
Article in English | MEDLINE | ID: mdl-38380552

ABSTRACT

Speech recognition becomes increasingly important in the modern society, especially for human-machine interactions, but its deployment is still severely thwarted by the struggle of machines to recognize voiced commands in challenging real-life settings: oftentimes, ambient noise drowns the acoustic sound signals, and walls, face masks or other obstacles hide the mouth motion from optical sensors. To address these formidable challenges, an experimental prototype of a microwave speech recognizer empowered by programmable metasurface is presented here that can remotely recognize human voice commands and speaker identities even in noisy environments and if the speaker's mouth is hidden behind a wall or face mask. The programmable metasurface is the pivotal hardware ingredient of the system because its large aperture and huge number of degrees of freedom allows the system to perform a complex sequence of sensing tasks, orchestrated by artificial-intelligence tools. Relying solely on microwave data, the system avoids visual privacy infringements. The developed microwave speech recognizer can enable privacy-respecting voice-commanded human-machine interactions is experimentally demonstrated in many important but to-date inaccessible application scenarios. The presented strategy will unlock new possibilities and have expectations for future smart homes, ambient-assisted health monitoring, as well as intelligent surveillance and security.


Subject(s)
Microwaves , Speech Recognition Software , Humans
9.
JASA Express Lett ; 4(2)2024 Feb 01.
Article in English | MEDLINE | ID: mdl-38350077

ABSTRACT

Measuring how well human listeners recognize speech under varying environmental conditions (speech intelligibility) is a challenge for theoretical, technological, and clinical approaches to speech communication. The current gold standard-human transcription-is time- and resource-intensive. Recent advances in automatic speech recognition (ASR) systems raise the possibility of automating intelligibility measurement. This study tested 4 state-of-the-art ASR systems with second language speech-in-noise and found that one, whisper, performed at or above human listener accuracy. However, the content of whisper's responses diverged substantially from human responses, especially at lower signal-to-noise ratios, suggesting both opportunities and limitations for ASR--based speech intelligibility modeling.


Subject(s)
Speech Perception , Humans , Speech Perception/physiology , Noise/adverse effects , Speech Intelligibility/physiology , Speech Recognition Software , Recognition, Psychology
10.
Radiol Artif Intell ; 6(2): e230205, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38265301

ABSTRACT

This study evaluated the ability of generative large language models (LLMs) to detect speech recognition errors in radiology reports. A dataset of 3233 CT and MRI reports was assessed by radiologists for speech recognition errors. Errors were categorized as clinically significant or not clinically significant. Performances of five generative LLMs-GPT-3.5-turbo, GPT-4, text-davinci-003, Llama-v2-70B-chat, and Bard-were compared in detecting these errors, using manual error detection as the reference standard. Prompt engineering was used to optimize model performance. GPT-4 demonstrated high accuracy in detecting clinically significant errors (precision, 76.9%; recall, 100%; F1 score, 86.9%) and not clinically significant errors (precision, 93.9%; recall, 94.7%; F1 score, 94.3%). Text-davinci-003 achieved F1 scores of 72% and 46.6% for clinically significant and not clinically significant errors, respectively. GPT-3.5-turbo obtained 59.1% and 32.2% F1 scores, while Llama-v2-70B-chat scored 72.8% and 47.7%. Bard showed the lowest accuracy, with F1 scores of 47.5% and 20.9%. GPT-4 effectively identified challenging errors of nonsense phrases and internally inconsistent statements. Longer reports, resident dictation, and overnight shifts were associated with higher error rates. In conclusion, advanced generative LLMs show potential for automatic detection of speech recognition errors in radiology reports. Keywords: CT, Large Language Model, Machine Learning, MRI, Natural Language Processing, Radiology Reports, Speech, Unsupervised Learning Supplemental material is available for this article.


Subject(s)
Camelids, New World , Radiology Information Systems , Radiology , Speech Perception , Animals , Speech , Speech Recognition Software , Reproducibility of Results
11.
Sci Rep ; 14(1): 313, 2024 01 03.
Article in English | MEDLINE | ID: mdl-38172277

ABSTRACT

Tashlhiyt is a low-resource language with respect to acoustic databases, language corpora, and speech technology tools, such as Automatic Speech Recognition (ASR) systems. This study investigates whether a method of cross-language re-use of ASR is viable for Tashlhiyt from an existing commercially-available system built for Arabic. The source and target language in this case have similar phonological inventories, but Tashlhiyt permits typologically rare phonological patterns, including vowelless words, while Arabic does not. We find systematic disparities in ASR transfer performance (measured as word error rate (WER) and Levenshtein distance) for Tashlhiyt across word forms and speaking style variation. Overall, performance was worse for casual speaking modes across the board. In clear speech, performance was lower for vowelless than for voweled words. These results highlight systematic speaking mode- and phonotactic-disparities in cross-language ASR transfer. They also indicate that linguistically-informed approaches to ASR re-use can provide more effective ways to adapt existing speech technology tools for low resource languages, especially when they contain typologically rare structures. The study also speaks to issues of linguistic disparities in ASR and speech technology more broadly. It can also contribute to understanding the extent to which machines are similar to, or different from, humans in mapping the acoustic signal to discrete linguistic representations.


Subject(s)
Speech Perception , Humans , Language , Linguistics , Speech , Speech Recognition Software
12.
Stud Health Technol Inform ; 310: 124-128, 2024 Jan 25.
Article in English | MEDLINE | ID: mdl-38269778

ABSTRACT

Creating notes in the EHR is one of the most problematic aspects for health professionals. The main challenges are the time spent on this task and the quality of the records. Automatic speech recognition technologies aim to facilitate clinical documentation for users, optimizing their workflow. In our hospital, we internally developed an automatic speech recognition system (ASR) to record progress notes in a mobile EHR. The objective of this article is to describe the pilot study carried out to evaluate the implementation of ASR to record progress notes in a mobile EHR application. As a result, the specialty that used ASR the most was Home Medicine. The lack of access to a computer at the time of care and the need to perform short and fast evolutions were the main reasons for users to use the system.


Subject(s)
Documentation , Speech Recognition Software , Humans , Pilot Projects , Health Personnel , Hospitals
13.
Int J Med Inform ; 178: 105213, 2023 10.
Article in English | MEDLINE | ID: mdl-37690224

ABSTRACT

PURPOSE: Considering the significant workload of nursing tasks, enhancing the efficiency of nursing documentation is imperative. This study aimed to evaluate the effectiveness of a machine learning-based speech recognition (SR) system in reducing the clinical workload associated with typing nursing records, implemented in a psychiatry ward. METHODS: The study was conducted between July 15, 2020, and June 30, 2021, at Cheng Hsin General Hospital in Taiwan. The language corpus was based on the existing records from the hospital nursing information system. The participating ward's nursing activities, clinical conversation, and accent data were also collected for deep learning-based SR-engine training. A total of 21 nurses participated in the evaluation of the SR system. Documentation time and recognition error rate were evaluated in parallel between SR-generated records and keyboard entry over 4 sessions. Any differences between SR and keyboard transcriptions were regarded as SR errors. FINDINGS: A total of 200 data were obtained from four evaluation sessions, 10 participants were asked to use SR and keyboard entry in parallel at each session and 5 entries were collected from each participant. Overall, the SR system processed 30,112 words in 32,456 s (0.928 words per second). The mean accuracy of the SR system improved after each session, from 87.06% in 1st session to 95.07% in 4th session. CONCLUSION: This pilot study demonstrated our machine learning-based SR system has an acceptable recognition accuracy and may reduce the burden of documentation for nurses. However, the potential error with the SR transcription should continually be recognized and improved. Further studies are needed to improve the integration of SR in digital documentation of nursing records, in terms of both productivity and accuracy across different clinical specialties.


Subject(s)
Speech Recognition Software , Speech , Humans , Pilot Projects , Perception , Documentation
15.
Article in English | MEDLINE | ID: mdl-37603475

ABSTRACT

Automatic Speech Recognition (ASR) technologies can be life-changing for individuals who suffer from dysarthria, a speech impairment that affects articulatory muscles and results in incomprehensive speech. Nevertheless, the performance of the current dysarthric ASR systems is unsatisfactory, especially for speakers with severe dysarthria who most benefit from this technology. While transformer and neural attention-base sequences-to-sequence ASR systems achieved state-of-the-art results in converting healthy speech to text, their applications as a Dysarthric ASR remain unexplored due to the complexities of dysarthric speech and the lack of extensive training data. In this study, we addressed this gap and proposed our Dysarthric Speech Transformer that uses a customized deep transformer architecture. To deal with the data scarcity problem, we designed a two-phase transfer learning pipeline to leverage healthy speech, investigated neural freezing configurations, and utilized audio data augmentation. Overall, we trained 45 speaker-adaptive dysarthric ASR in our investigations. Results indicate the effectiveness of the transfer learning pipeline and data augmentation, and emphasize the significance of deeper transformer architectures. The proposed ASR outperformed the state-of-the-art and delivered better accuracies for 73% of the dysarthric subjects whose speech samples were employed in this study, in which up to 23% of improvements were achieved.


Subject(s)
Dysarthria , Speech , Humans , Speech Recognition Software , Speech Disorders , Learning
16.
Sensors (Basel) ; 23(13)2023 Jun 29.
Article in English | MEDLINE | ID: mdl-37447886

ABSTRACT

This paper proposes a speech recognition method based on a domain-specific language speech network (DSL-Net) and a confidence decision network (CD-Net). The method involves automatically training a domain-specific dataset, using pre-trained model parameters for migration learning, and obtaining a domain-specific speech model. Importance sampling weights were set for the trained domain-specific speech model, which was then integrated with the trained speech model from the benchmark dataset. This integration automatically expands the lexical content of the model to accommodate the input speech based on the lexicon and language model. The adaptation attempts to address the issue of out-of-vocabulary words that are likely to arise in most realistic scenarios and utilizes external knowledge sources to extend the existing language model. By doing so, the approach enhances the adaptability of the language model in new domains or scenarios and improves the prediction accuracy of the model. For domain-specific vocabulary recognition, a deep fully convolutional neural network (DFCNN) and a candidate temporal classification (CTC)-based approach were employed to achieve effective recognition of domain-specific vocabulary. Furthermore, a confidence-based classifier was added to enhance the accuracy and robustness of the overall approach. In the experiments, the method was tested on a proprietary domain audio dataset and compared with an automatic speech recognition (ASR) system trained on a large-scale dataset. Based on experimental verification, the model achieved an accuracy improvement from 82% to 91% in the medical domain. The inclusion of domain-specific datasets resulted in a 5% to 7% enhancement over the baseline, while the introduction of model confidence further improved the baseline by 3% to 5%. These findings demonstrate the significance of incorporating domain-specific datasets and model confidence in advancing speech recognition technology.


Subject(s)
Models, Theoretical , Neural Networks, Computer , Speech Recognition Software , Speech , Speech Perception , Datasets as Topic , Sound Spectrography
17.
Int J Med Inform ; 176: 105112, 2023 08.
Article in English | MEDLINE | ID: mdl-37276615

ABSTRACT

BACKGROUND: The purpose of this study is to develop an audio speech recognition (ASR) deep learning model for transcribing clinician-patient conversations in radiation oncology clinics. METHODS: We finetuned the pre-trained English QuartzNet 15x5 model for the Korean language using a publicly available dataset of simulated situations between clinicians and patients. Subsequently, real conversations between a radiation oncologist and 115 patients in actual clinics were then prospectively collected, transcribed, and divided into training (30.26 h) and testing (0.79 h) sets. These datasets were used to develop the ASR model for clinics, which was benchmarked against other ASR models, including the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.' RESULTS: The pre-trained English ASR model was successfully fine-tuned and converted to recognize the Korean language, resulting in a character error rate (CER) of 0.17. However, we found that this performance was not sustained on the real conversation dataset. To address this, we further fine-tuned the model, resulting in an improved CER of 0.26. Other developed ASR models, including 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.', showed a CER of 0.31, 0.28, and 0.25, respectively. On the general Korean conversation dataset, 'zeroth-korean,' our model showed a CER of 0.44, while the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model' resulted in CERs of 0.26, 0.98, and 0.99, respectively. CONCLUSION: In conclusion, we developed a Korean ASR model to transcribe real conversations between a radiation oncologist and patients. The performance of the model was deemed acceptable for both specific and general purposes, compared to other models. We anticipate that this model will reduce the time required for clinicians to document the patient's chief complaints or side effects.


Subject(s)
Radiation Oncology , Speech Perception , Humans , Speech Recognition Software , Benchmarking , Language , Republic of Korea
18.
Sensors (Basel) ; 23(11)2023 May 30.
Article in English | MEDLINE | ID: mdl-37299935

ABSTRACT

The field of computational paralinguistics emerged from automatic speech processing, and it covers a wide range of tasks involving different phenomena present in human speech. It focuses on the non-verbal content of human speech, including tasks such as spoken emotion recognition, conflict intensity estimation and sleepiness detection from speech, showing straightforward application possibilities for remote monitoring with acoustic sensors. The two main technical issues present in computational paralinguistics are (1) handling varying-length utterances with traditional classifiers and (2) training models on relatively small corpora. In this study, we present a method that combines automatic speech recognition and paralinguistic approaches, which is able to handle both of these technical issues. That is, we trained a HMM/DNN hybrid acoustic model on a general ASR corpus, which was then used as a source of embeddings employed as features for several paralinguistic tasks. To convert the local embeddings into utterance-level features, we experimented with five different aggregation methods, namely mean, standard deviation, skewness, kurtosis and the ratio of non-zero activations. Our results show that the proposed feature extraction technique consistently outperforms the widely used x-vector method used as the baseline, independently of the actual paralinguistic task investigated. Furthermore, the aggregation techniques could be combined effectively as well, leading to further improvements depending on the task and the layer of the neural network serving as the source of the local embeddings. Overall, based on our experimental results, the proposed method can be considered as a competitive and resource-efficient approach for a wide range of computational paralinguistic tasks.


Subject(s)
Speech Perception , Speech , Humans , Neural Networks, Computer , Speech Recognition Software , Acoustics
19.
Psychiatry Res ; 325: 115252, 2023 07.
Article in English | MEDLINE | ID: mdl-37236098

ABSTRACT

Natural language processing (NLP) tools are increasingly used to quantify semantic anomalies in schizophrenia. Automatic speech recognition (ASR) technology, if robust enough, could significantly speed up the NLP research process. In this study, we assessed the performance of a state-of-the-art ASR tool and its impact on diagnostic classification accuracy based on a NLP model. We compared ASR to human transcripts quantitatively (Word Error Rate (WER)) and qualitatively by analyzing error type and position. Subsequently, we evaluated the impact of ASR on classification accuracy using semantic similarity measures. Two random forest classifiers were trained with similarity measures derived from automatic and manual transcriptions, and their performance was compared. The ASR tool had a mean WER of 30.4%. Pronouns and words in sentence-final position had the highest WERs. The classification accuracy was 76.7% (sensitivity 70%; specificity 86%) using automated transcriptions and 79.8% (sensitivity 75%; specificity 86%) for manual transcriptions. The difference in performance between the models was not significant. These findings demonstrate that using ASR for semantic analysis is associated with only a small decrease in accuracy in classifying schizophrenia, compared to manual transcripts. Thus, combining ASR technology with semantic NLP models qualifies as a robust and efficient method for diagnosing schizophrenia.


Subject(s)
Schizophrenia , Speech Perception , Humans , Semantics , Speech Recognition Software , Natural Language Processing , Schizophrenia/complications , Schizophrenia/diagnosis , Speech
20.
Article in English | MEDLINE | ID: mdl-37030692

ABSTRACT

Dysarthric speech recognition helps speakers with dysarthria to enjoy better communication. However, collecting dysarthric speech is difficult. The machine learning models cannot be trained sufficiently using dysarthric speech. To further improve the accuracy of dysarthric speech recognition, we proposed a Multi-stage AV-HuBERT (MAV-HuBERT) framework by fusing the visual information and acoustic information of the dysarthric speech. During the first stage, we proposed to use convolutional neural networks model to encode the motor information by incorporating all facial speech function areas. This operation is different from the traditional approach solely based on the movement of lip in audio-visual fusion framework. During the second stage, we proposed to use the AV-HuBERT framework to pre-train the recognition architecture of fusing audio and visual information of the dysarthric speech. The knowledge gained by the pre-trained model is applied to address the overfitting problem of the model. The experiments based on UASpeech are designed to evaluate our proposed method. Compared with the results of the baseline method, the best word error rate (WER) of our proposed method was reduced by 13.5% on moderate dysarthric speech. In addition, for the mild dysarthric speech, our proposed method shows the best result that the WER of our proposed method arrives at 6.05%. Even for the extremely severe dysarthric speech, the WER of our proposed method achieves at 63.98%, which reduces by 2.72% and 4.02% compared with the WERs of wav2vec and HuBERT, respectively. The proposed method can effectively further reduce the WER of the dysarthric speech.


Subject(s)
Dysarthria , Speech Perception , Humans , Speech , Speech Recognition Software , Neural Networks, Computer , Speech Intelligibility
SELECTION OF CITATIONS
SEARCH DETAIL
...